You may add as many code and markdown cells as you see fit to answer the questions.
You will demonstrate your ability to merge, group, summarize, visualize, and find patterns in data. This exam uses data associated with a manufacturing example. An overview of the goals, considerations, CSV files, and variables within the data is provided in a presentation on Canvas. Please read through those slides before starting the exam.
The data are provided in 5 separate CSV files. The CSV files are available on Canvas. You MUST download the files and save them to the same working directory as this notebook.
The specific instructions in this notebook tell you when you must JOIN the data together. Please read the problems carefully.
The overall objective of this exam is to JOIN data from multiple files in order to explore and find interesting patterns between the machine operating conditions and supplier information. You will report your findings within this notebook by displaying Pandas DataFrames and statistical visualizations via Seaborn and matplotlib when necessary.
You are permitted to use the following modules on this exam.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
You may also use the following functions from scikit-learn on this exam.
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
You may also use the following sub module from SCIPY.
from scipy.cluster import hierarchy
You are NOT permitted to use any other modules or functions. However, you ARE permitted to create your own user defined functions if you would like.
The file names for the 3 machine data sets are provided as strings in the cell below. You are required to read in the CSV files and assign the data to the m01_df, m02_df, and m03_df objects. The data from machine 1 will therefore be associated with m01_df, machine 2 is associated with m02_df, and machine 3 is associated with m03_df.
In this problem you must explore each of the three machine data sets.
You must perform the following ESSENTIAL activities:
You must visually explore the MARGINAL behavior of the variables in the data. You must use visualizations appropriate for the DATA TYPE of the columns.
You must visually explore RELATIONSHIPS between variables in the data. You must use visualizations appropriate for the DATA TYPES. You must make sure that your visualizations can answer the following questions:
Batch are associated with each MACHINE (data set)?Batch value for each MACHINE (data set)?x1 through x4 vary across the three MACHINES?x1 through x4 vary across the Batch values?x1 through x4 vary across the three MACHINES?x1 through x4 vary across the Batch values?At the conclusion of this problem, you MUST CONCATENATE the 3 MACHINE data sets into a single DataFrame. The single DataFrame must be named machine_df. Before concatenating, you MUST add a column machine_id to each DataFrame with the correct index value for that machine (1, 2, or 3). The concatenating DataFrame variable name is provided as a reminder to you below.
You may add as many markdown and code cells as you see fit to answer this question. Include markdown cells stating what you see in the figures and why you selected to use them.
# Define the files's for the 3 machine level CSV files
file_m01 = 'midterm_machine_01.csv'
file_m02 = 'midterm_machine_02.csv'
file_m03 = 'midterm_machine_03.csv'
# read in the CSV files and name them accordingly
m01_df = pd.read_csv(file_m01)
m02_df = pd.read_csv(file_m02)
m03_df = pd.read_csv(file_m03)
## 1. How many rows and columns are in each data set?
dataframes = {'m01_df': m01_df, 'm02_df': m02_df, 'm03_df': m03_df}
for name, df in dataframes.items():
print(f'{name} has {df.shape[0]} rows and {df.shape[1]} columns')
m01_df has 5152 rows and 7 columns m02_df has 5119 rows and 7 columns m03_df has 4458 rows and 7 columns
## 2.What are the names and data types of the columns in each data set?
dataframes = {'m01_df': m01_df, 'm02_df': m02_df, 'm03_df': m03_df}
for name, df in dataframes.items():
print(f'The names and data types of the columns in {name} are \n{df.dtypes}')
The names and data types of the columns in m01_df are ID object Batch int64 s_id int64 x1 float64 x2 float64 x3 float64 x4 float64 dtype: object The names and data types of the columns in m02_df are ID object Batch int64 s_id int64 x1 float64 x2 float64 x3 float64 x4 float64 dtype: object The names and data types of the columns in m03_df are ID object Batch int64 s_id int64 x1 float64 x2 float64 x3 float64 x4 float64 dtype: object
## 3.How many unique values are there for each column of each data set?
dataframes = {'m01_df': m01_df, 'm02_df': m02_df, 'm03_df': m03_df}
for name, df in dataframes.items():
print(f'The unique values for {name} are \n{df.nunique()}')
The unique values for m01_df are ID 5152 Batch 50 s_id 149 x1 5152 x2 5152 x3 5152 x4 5152 dtype: int64 The unique values for m02_df are ID 5119 Batch 50 s_id 133 x1 5119 x2 5119 x3 5119 x4 5119 dtype: int64 The unique values for m03_df are ID 4458 Batch 41 s_id 141 x1 4458 x2 4458 x3 4458 x4 4458 dtype: int64
## 4. How many missing values are there for each column of each data set?
dataframes = {'m01_df': m01_df, 'm02_df': m02_df, 'm03_df': m03_df}
for name, df in dataframes.items():
print(f'The missing values for {name} are \n{df.isnull().sum()}')
The missing values for m01_df are ID 0 Batch 0 s_id 0 x1 0 x2 0 x3 0 x4 0 dtype: int64 The missing values for m02_df are ID 0 Batch 0 s_id 0 x1 0 x2 0 x3 0 x4 0 dtype: int64 The missing values for m03_df are ID 0 Batch 0 s_id 0 x1 0 x2 0 x3 0 x4 0 dtype: int64
## Sum all
m_df = pd.concat([m01_df, m02_df, m03_df])
## m01_df Wide to Long Format
m01_df_lf = m01_df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['ID','rowid','Batch','s_id'], var_name='variable', value_name='value')
m01_df_lf['machine_id'] = 'm01'
## m02_df Wide to Long Format
m02_df_lf = m02_df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['ID','rowid','Batch','s_id'], var_name='variable', value_name='value')
m02_df_lf['machine_id'] = 'm02'
## m03_df Wide to Long Format
m03_df_lf = m03_df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['ID','rowid','Batch','s_id'], var_name='variable', value_name='value')
m03_df_lf['machine_id'] = 'm03'
## sum all in Long Format
m_df_lf = pd.concat([m01_df_lf, m02_df_lf, m03_df_lf])
m_df_lf
| ID | rowid | Batch | s_id | variable | value | machine_id | |
|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S001 | 0 | 1 | 1 | x1 | 50.117118 | m01 |
| 1 | B001-M01-S002 | 1 | 1 | 2 | x1 | 46.887333 | m01 |
| 2 | B001-M01-S003 | 2 | 1 | 3 | x1 | 50.132744 | m01 |
| 3 | B001-M01-S004 | 3 | 1 | 4 | x1 | 48.501042 | m01 |
| 4 | B001-M01-S005 | 4 | 1 | 5 | x1 | 49.690442 | m01 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 17827 | B049-M03-S100 | 4453 | 49 | 100 | x4 | 13.753264 | m03 |
| 17828 | B049-M03-S101 | 4454 | 49 | 101 | x4 | 13.636539 | m03 |
| 17829 | B049-M03-S102 | 4455 | 49 | 102 | x4 | 13.678206 | m03 |
| 17830 | B049-M03-S103 | 4456 | 49 | 103 | x4 | 13.880084 | m03 |
| 17831 | B049-M03-S104 | 4457 | 49 | 104 | x4 | 13.853014 | m03 |
58916 rows × 7 columns
m_df_lf
| ID | rowid | Batch | s_id | variable | value | machine_id | |
|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S001 | 0 | 1 | 1 | x1 | 50.117118 | m01 |
| 1 | B001-M01-S002 | 1 | 1 | 2 | x1 | 46.887333 | m01 |
| 2 | B001-M01-S003 | 2 | 1 | 3 | x1 | 50.132744 | m01 |
| 3 | B001-M01-S004 | 3 | 1 | 4 | x1 | 48.501042 | m01 |
| 4 | B001-M01-S005 | 4 | 1 | 5 | x1 | 49.690442 | m01 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 17827 | B049-M03-S100 | 4453 | 49 | 100 | x4 | 13.753264 | m03 |
| 17828 | B049-M03-S101 | 4454 | 49 | 101 | x4 | 13.636539 | m03 |
| 17829 | B049-M03-S102 | 4455 | 49 | 102 | x4 | 13.678206 | m03 |
| 17830 | B049-M03-S103 | 4456 | 49 | 103 | x4 | 13.880084 | m03 |
| 17831 | B049-M03-S104 | 4457 | 49 | 104 | x4 | 13.853014 | m03 |
58916 rows × 7 columns
# 5. Explore the MARGINAL behavior of the variables in the data.
marginal_df = m_df_lf.groupby(['variable','machine_id']).size().reset_index(name='num_rows')
sns.catplot(data = marginal_df, x = 'variable', col = 'machine_id', kind = 'bar', y = 'num_rows')
<seaborn.axisgrid.FacetGrid at 0x229fa7bd090>
# 6. Explore the RELATIONSHIP between the variables in the data.
## 6.1. How many unique values for Batch are associated with each machine? Plot the overall length of Batch of each machine.
unique_batches_per_machine = m_df_lf.groupby('machine_id')['Batch'].nunique()
sns.barplot(x=unique_batches_per_machine.index, y=unique_batches_per_machine.values)
## 6.2 How many cell phone cases are associated with each Batch value for each machine?
sns.displot(data=m_df_lf, x='Batch', hue='machine_id', multiple='stack', col='machine_id',bins=20)
## 6.3. Summary statistics for each variable from x1 to x4 for each machine.
sns.catplot(data=m_df_lf, x = 'variable', y='value', hue='machine_id', kind='box', col='machine_id')
## 6.4 Summary statistics for each variable from x1 to x4 across Batch
sns.catplot(data=m_df_lf, x = 'Batch', y='value', kind='box', col='variable', height=6, aspect=2, col_wrap=1, sharey=False)
## 6.5 Relationship between x1 through x4 for each machine.
sns.pairplot(data=m01_df, vars=['x1','x2','x3','x4'], diag_kws={'common_norm':False})
sns.pairplot(data=m02_df, vars=['x1','x2','x3','x4'], diag_kws={'common_norm':False})
sns.pairplot(data=m03_df, vars=['x1','x2','x3','x4'], diag_kws={'common_norm':False})
## 6.6 Relationship between x1 through x4 across Batch, here, m_df is the sum of all machines
sns.pairplot(data=m_df, hue='Batch', vars=['x1','x2','x3','x4'], palette='coolwarm',
diag_kws={'common_norm':False})
<seaborn.axisgrid.PairGrid at 0x2298219b950>
## Further plot for s_id and other variables
### for md01_df
sns.catplot(data=m01_df, x='s_id', y='Batch', kind='point',aspect=10)
sns.catplot(data=m01_df, x='s_id', y='x1', kind='point',aspect=10)
sns.catplot(data=m01_df, x='s_id', y='x2', kind='point',aspect=10)
sns.catplot(data=m01_df, x='s_id', y='x3', kind='point',aspect=10)
sns.catplot(data=m01_df, x='s_id', y='x4', kind='point',aspect=10)
plt.show()
### for md02_df
sns.catplot(data=m02_df, x='s_id', y='Batch', kind='point',aspect=10)
sns.catplot(data=m02_df, x='s_id', y='x1', kind='point',aspect=10)
sns.catplot(data=m02_df, x='s_id', y='x2', kind='point',aspect=10)
sns.catplot(data=m02_df, x='s_id', y='x3', kind='point',aspect=10)
sns.catplot(data=m02_df, x='s_id', y='x4', kind='point',aspect=10)
plt.show()
### for md03_df
sns.catplot(data=m03_df, x='s_id', y='Batch', kind='point',aspect=10)
sns.catplot(data=m03_df, x='s_id', y='x1', kind='point',aspect=10)
sns.catplot(data=m03_df, x='s_id', y='x2', kind='point',aspect=10)
sns.catplot(data=m03_df, x='s_id', y='x3', kind='point',aspect=10)
sns.catplot(data=m03_df, x='s_id', y='x4', kind='point',aspect=10)
plt.show()
# concatenate the 3 DataFrames into a single DataFrame which includes the `machine_id` variable
# add a new column, if it is from m01_df, the value is 1, if it is from m02_df, the value is 2, if it is from m03_df, the value is 3
m01_df_cp = m01_df.copy()
m01_df_cp['machine_id'] = 1
m02_df_cp = m02_df.copy()
m02_df_cp['machine_id'] = 2
m03_df_cp = m03_df.copy()
m03_df_cp['machine_id'] = 3
machine_df = pd.concat([m01_df_cp, m02_df_cp, m03_df_cp])
machine_df
# Or we can use the following code to add the machine_id column
machine_df = pd.concat([m01_df.assign(machine_id=1), m02_df.assign(machine_id=2), m03_df.assign(machine_id=3)])
machine_df
| ID | Batch | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S001 | 1 | 1 | 50.117118 | 102.167346 | 22.067812 | 13.889524 | 1 |
| 1 | B001-M01-S002 | 1 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | B001-M01-S003 | 1 | 3 | 50.132744 | 103.674908 | 22.319933 | 13.647482 | 1 |
| 3 | B001-M01-S004 | 1 | 4 | 48.501042 | 107.143156 | 22.162947 | 14.077758 | 1 |
| 4 | B001-M01-S005 | 1 | 5 | 49.690442 | 102.120283 | 22.248696 | 13.728666 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4453 | B049-M03-S100 | 49 | 100 | 49.278415 | 102.089925 | 22.396979 | 13.753264 | 3 |
| 4454 | B049-M03-S101 | 49 | 101 | 49.264754 | 104.619874 | 21.977716 | 13.636539 | 3 |
| 4455 | B049-M03-S102 | 49 | 102 | 49.465358 | 102.867473 | 22.257845 | 13.678206 | 3 |
| 4456 | B049-M03-S103 | 49 | 103 | 49.751785 | 104.863427 | 22.461204 | 13.880084 | 3 |
| 4457 | B049-M03-S104 | 49 | 104 | 49.819212 | 104.075176 | 22.462165 | 13.853014 | 3 |
14729 rows × 8 columns
The supplier batch data set file name is provided for you below. You must read in the CSV file and assign the data set to the batch_df object.
You must perform the following ESSENTIAL activities:
You must visually explore the MARGINAL behavior of the variables in the data. You must use visualizations appropriate for the DATA TYPE of the columns.
You must visually explore RELATIONSHIPS between variables in the data. You must use visualizations appropriate for the DATA TYPES. You must make sure that your visualizations can answer the following questions:
Density depend on the Supplier?Density depend on the Supplier?Density relate to Batch for each Supplier?After exploring the batch_df DataFrame, you MUST JOIN/MERGE the batch_df DataFrame with the machine_df DataFrame. Assign the merged DataFrame to the dfa DataFrame.
You can now explore the relationships between the MACHINE OPERATIONAL VARIABLES and the SUPPLIERS! You must use visualizations to explore the following relationships:
x1 through x4 vary across Batch for each MACHINE given each Supplier. Your figures MUST use Batch as the x-axis variable.x1 through x4 vary across Supplier.You may add as many markdown and code cells as you see fit to answer this question.
# define the batch supplier file
batch_file = 'midterm_supplier.csv'
# read in the batch supplier data set
batch_df = pd.read_csv(batch_file)
## 1. How many rows and columns are in the data set?
print(f'The batch_df has {batch_df.shape[0]} rows and {batch_df.shape[1]} columns')
The batch_df has 50 rows and 3 columns
## 2. What are the names and data types of the columns in the data set?
print(f'The names and data types of the columns in batch_df are \n{batch_df.dtypes}')
The names and data types of the columns in batch_df are Batch int64 Supplier object Density float64 dtype: object
## 3. How many unique values are there for each column of the data set?
print(f'The unique values for batch_df are \n{batch_df.nunique()}')
The unique values for batch_df are Batch 50 Supplier 2 Density 50 dtype: int64
## 4. How many missing values are there for each column of the data set?
print(f'The missing values for batch_df are \n{batch_df.isnull().sum()}')
The missing values for batch_df are Batch 0 Supplier 0 Density 0 dtype: int64
## 5. Explore the MARGINAL behavior of the variables in the data.
marginal_batch_df = batch_df.groupby(['Supplier']).size().reset_index(name='num_rows')
sns.catplot(data = marginal_batch_df, x = 'Supplier', kind = 'bar', y = 'num_rows')
<seaborn.axisgrid.FacetGrid at 0x229ff3e22d0>
## 6. Explore the RELATIONSHIP between the variables in the data.
### 6.1. Summary statistics for each variable from Density to Supplier.
sns.catplot(data=batch_df, x='Supplier', y='Density', kind='box',aspect=2)
plt.show()
### 6.2. Average Density for each Supplier.
sns.pointplot(data=batch_df, y='Density', x='Supplier',join=False)
plt.show()
### 6.3 Desnsity relationship to batch for each Supplier.
sns.pairplot(batch_df, hue='Supplier',
diag_kws={'common_norm':False})
plt.show()
merge the batch supplier data set with the (concatenated) machine data set
# 7. merge the batch supplier data set with the (concatenated) machine data set
dfa = pd.merge(machine_df, batch_df, on='Batch', how='inner')
dfa
| ID | Batch | s_id | x1 | x2 | x3 | x4 | machine_id | Supplier | Density | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S001 | 1 | 1 | 50.117118 | 102.167346 | 22.067812 | 13.889524 | 1 | B | 10.388587 |
| 1 | B001-M01-S002 | 1 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 | B | 10.388587 |
| 2 | B001-M01-S003 | 1 | 3 | 50.132744 | 103.674908 | 22.319933 | 13.647482 | 1 | B | 10.388587 |
| 3 | B001-M01-S004 | 1 | 4 | 48.501042 | 107.143156 | 22.162947 | 14.077758 | 1 | B | 10.388587 |
| 4 | B001-M01-S005 | 1 | 5 | 49.690442 | 102.120283 | 22.248696 | 13.728666 | 1 | B | 10.388587 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14724 | B050-M02-S092 | 50 | 92 | 52.690936 | 95.050804 | 20.324674 | 10.963370 | 2 | B | 9.935676 |
| 14725 | B050-M02-S093 | 50 | 93 | 52.348015 | 97.863144 | 20.884779 | 11.156604 | 2 | B | 9.935676 |
| 14726 | B050-M02-S094 | 50 | 94 | 52.187241 | 99.887013 | 20.439142 | 11.303222 | 2 | B | 9.935676 |
| 14727 | B050-M02-S095 | 50 | 95 | 52.875621 | 101.572057 | 20.648029 | 11.084512 | 2 | B | 9.935676 |
| 14728 | B050-M02-S096 | 50 | 96 | 52.491445 | 96.744458 | 20.814762 | 10.886910 | 2 | B | 9.935676 |
14729 rows × 10 columns
## 7.1 Summary statistics for the variables in the data set x1 to x4 vary across Batch given Supplier
sns.catplot(data=dfa, x='Batch', y='x1', kind='box', col ='Supplier', height=5, aspect=2, col_wrap=2)
<seaborn.axisgrid.FacetGrid at 0x229ff067fd0>
## 7.2 Relationship between x1 through x4 across Batch given Supplier
sns.pairplot(data=dfa, hue='Supplier', vars=['x1','x2','x3','x4'],
diag_kws={'common_norm':False})
<seaborn.axisgrid.PairGrid at 0x229815e72d0>
The DROP TEST result data set file name is provided for you below. You must read in the CSV file and assign the dta set to the test_df object.
You must perform the following ESSENTIAL activities:
You must visually explore the MARGINAL behavior of the variables in the data. You must use visualizations appropriate for the DATA TYPE of the columns.
You must visually explore RELATIONSHIPS between variables in the data. You must use visualizations appropriate for the DATA TYPES. You must make sure that your visualizations can answer the following questions:
Result occurs for each test_group_id value.After exploring the test_df DataFrame, you MUST JOIN/MERGE the test_df DataFrame with the dfa DataFrame. Assign the merged DataFrame to the dfb DataFrame. You MUST answer the following:
You may add as many markdown and code cells as you see fit to answer this question.
# define the test data set file name
test_file = 'midterm_test.csv'
# read in the test data set
test_df = pd.read_csv(test_file)
test_df
| ID | test_group_id | Result | |
|---|---|---|---|
| 0 | B001-M01-S056 | A-0 | 1 |
| 1 | B001-M01-S002 | A-0 | 1 |
| 2 | B001-M01-S049 | A-0 | 1 |
| 3 | B001-M01-S030 | A-0 | 0 |
| 4 | B001-M01-S048 | A-0 | 1 |
| ... | ... | ... | ... |
| 1407 | B049-M03-S026 | K-5 | 1 |
| 1408 | B049-M03-S008 | K-5 | 1 |
| 1409 | B049-M03-S041 | K-5 | 1 |
| 1410 | B049-M03-S061 | K-5 | 1 |
| 1411 | B049-M03-S051 | K-5 | 1 |
1412 rows × 3 columns
# 1. How many rows and columns are in the data set?
print(f'The test_df has {test_df.shape[0]} rows and {test_df.shape[1]} columns')
The test_df has 1412 rows and 3 columns
# 2. What are the names and data types of the columns in the data set?
print(f'The names and data types of the columns in test_df are \n{test_df.dtypes}')
The names and data types of the columns in test_df are ID object test_group_id object Result int64 dtype: object
# 3. How many unique values are there for each column of the data set?
print(f'The unique values for test_df are \n{test_df.nunique()}')
The unique values for test_df are ID 1412 test_group_id 141 Result 2 dtype: int64
# 4. How many missing values are there for each column of the data set?
print(f'The missing values for test_df are \n{test_df.isnull().sum()}')
The missing values for test_df are ID 0 test_group_id 0 Result 0 dtype: int64
# 5. Explore the MARGINAL behavior of the variables in the data.
marginal_test_df = test_df.groupby(['Result']).size().reset_index(name='num_rows')
sns.catplot(data = marginal_test_df, x = 'Result', kind = 'bar', y = 'num_rows')
<seaborn.axisgrid.FacetGrid at 0x2298165e010>
# 6. Explore the RELATIONSHIP between the variables in the data.
## 6.1 Count the number of times each unique value of Result occurs for each test_group_id value
test_df_unique_count = test_df.groupby(['test_group_id','Result']).size().reset_index(name='num_rows')
sns.catplot(data = test_df_unique_count, x = 'test_group_id', col = 'Result', kind = 'bar', y = 'num_rows', col_wrap=1, height=6, aspect=5)
plt.xticks(fontsize=10)
plt.show()
dfa
| ID | Batch | s_id | x1 | x2 | x3 | x4 | machine_id | Supplier | Density | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S001 | 1 | 1 | 50.117118 | 102.167346 | 22.067812 | 13.889524 | 1 | B | 10.388587 |
| 1 | B001-M01-S002 | 1 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 | B | 10.388587 |
| 2 | B001-M01-S003 | 1 | 3 | 50.132744 | 103.674908 | 22.319933 | 13.647482 | 1 | B | 10.388587 |
| 3 | B001-M01-S004 | 1 | 4 | 48.501042 | 107.143156 | 22.162947 | 14.077758 | 1 | B | 10.388587 |
| 4 | B001-M01-S005 | 1 | 5 | 49.690442 | 102.120283 | 22.248696 | 13.728666 | 1 | B | 10.388587 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14724 | B050-M02-S092 | 50 | 92 | 52.690936 | 95.050804 | 20.324674 | 10.963370 | 2 | B | 9.935676 |
| 14725 | B050-M02-S093 | 50 | 93 | 52.348015 | 97.863144 | 20.884779 | 11.156604 | 2 | B | 9.935676 |
| 14726 | B050-M02-S094 | 50 | 94 | 52.187241 | 99.887013 | 20.439142 | 11.303222 | 2 | B | 9.935676 |
| 14727 | B050-M02-S095 | 50 | 95 | 52.875621 | 101.572057 | 20.648029 | 11.084512 | 2 | B | 9.935676 |
| 14728 | B050-M02-S096 | 50 | 96 | 52.491445 | 96.744458 | 20.814762 | 10.886910 | 2 | B | 9.935676 |
14729 rows × 10 columns
# merge test_df with the dfa object
dfb = pd.merge(dfa, test_df, on='ID')
dfb
| ID | Batch | s_id | x1 | x2 | x3 | x4 | machine_id | Supplier | Density | test_group_id | Result | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S002 | 1 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 | B | 10.388587 | A-0 | 1 |
| 1 | B001-M01-S024 | 1 | 24 | 51.531574 | 100.207219 | 22.281345 | 13.796810 | 1 | B | 10.388587 | A-0 | 1 |
| 2 | B001-M01-S030 | 1 | 30 | 48.306819 | 109.445722 | 22.910858 | 13.839830 | 1 | B | 10.388587 | A-0 | 0 |
| 3 | B001-M01-S038 | 1 | 38 | 52.058573 | 93.272568 | 21.937216 | 13.332882 | 1 | B | 10.388587 | A-0 | 1 |
| 4 | B001-M01-S048 | 1 | 48 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 | B | 10.388587 | A-0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1407 | B050-M02-S033 | 50 | 33 | 52.139310 | 101.869648 | 20.802603 | 10.803831 | 2 | B | 9.935676 | V-3 | 1 |
| 1408 | B050-M02-S064 | 50 | 64 | 52.048377 | 98.445039 | 21.305935 | 11.142095 | 2 | B | 9.935676 | V-3 | 0 |
| 1409 | B050-M02-S065 | 50 | 65 | 52.116809 | 101.042307 | 20.695658 | 11.051621 | 2 | B | 9.935676 | V-3 | 0 |
| 1410 | B050-M02-S073 | 50 | 73 | 52.484297 | 101.426409 | 20.506720 | 11.032949 | 2 | B | 9.935676 | V-3 | 0 |
| 1411 | B050-M02-S092 | 50 | 92 | 52.690936 | 95.050804 | 20.324674 | 10.963370 | 2 | B | 9.935676 | V-3 | 0 |
1412 rows × 12 columns
1412 rows reamin with the default joining procedure
You must now examine the merged dfb object and answer the following:
Result occurs for each value of machine_id.Result occurs for each value of Supplier.Result occurs per Batch for each value of machine_id.Result occurs per Batch for each value of machine_id and Supplier.Batch per machine_id.Batch per machine_id and for each unique value of Supplier.HINT: Remember that a FAILED test is encoded as Result == 1. How can you calculate the PROPORTION of times Result == 1?
Add as many cells as you see fit to answer this question.
# p4.1 Count the number of times each unique value of Result occurs for each value of machine_id
dfb_unique_count = dfb.groupby(['machine_id','Result']).size().reset_index(name='num_rows')
dfb_unique_count
| machine_id | Result | num_rows | |
|---|---|---|---|
| 0 | 1 | 0 | 333 |
| 1 | 1 | 1 | 159 |
| 2 | 2 | 0 | 342 |
| 3 | 2 | 1 | 147 |
| 4 | 3 | 0 | 311 |
| 5 | 3 | 1 | 120 |
# p4.2 Count the number of times each unique value of Result occurs for each value of Supplier
dfb_unique_count = dfb.groupby(['Supplier','Result']).size().reset_index(name='num_rows')
dfb_unique_count
| Supplier | Result | num_rows | |
|---|---|---|---|
| 0 | A | 0 | 588 |
| 1 | A | 1 | 290 |
| 2 | B | 0 | 398 |
| 3 | B | 1 | 136 |
# p4.3 Visualize the number of times each unique value of Result occurs per Batch for each value of machine_id
sns.catplot(data=dfb, x='Batch', col='machine_id', hue='Result', kind='count', height=4, aspect=3, col_wrap=1)
<seaborn.axisgrid.FacetGrid at 0x229ff1aa7d0>
# p4.4 Visualize the number of times each unique value of Result occurs per Batch for each value of machine_id and Supplier
sns.catplot(data=dfb, x='Batch', row='machine_id', hue='Result', kind='count', height=4, aspect=2.2, col='Supplier')
<seaborn.axisgrid.FacetGrid at 0x229ff8dd190>
# p4.5 Calculate the PROPORTION of times the cell phone case failed the test in each Batch per machine_id
dfb_prop = dfb.groupby(['Batch','machine_id','Result']).size().reset_index(name='num_rows')
dfb_prop
dfb_cnt_success = dfb_prop[dfb_prop['Result'] == 1].groupby(['Batch','machine_id','Result']).agg({'num_rows': 'sum'})
dfb_cnt_failed = dfb_prop[dfb_prop['Result'] == 0].groupby(['Batch','machine_id','Result']).agg({'num_rows': 'sum'})
dfb_cnt_all = dfb_prop.groupby(['Batch','machine_id']).agg({'num_rows': 'sum'})
dfb_prop_failed = dfb_cnt_failed / dfb_cnt_all
dfb_prop_failed.dropna(inplace=True)
dfb_prop_failed
| num_rows | |||
|---|---|---|---|
| Batch | machine_id | Result | |
| 1 | 1 | 0 | 0.142857 |
| 2 | 1 | 0 | 0.769231 |
| 2 | 0 | 0.818182 | |
| 3 | 0 | 1.000000 | |
| 3 | 1 | 0 | 0.800000 |
| ... | ... | ... | ... |
| 48 | 1 | 0 | 0.818182 |
| 2 | 0 | 0.600000 | |
| 3 | 0 | 0.818182 | |
| 50 | 1 | 0 | 0.900000 |
| 2 | 0 | 0.888889 |
126 rows × 1 columns
# p4.6 Visualize the proportion of times the cell phone case failed the test in each Batch per machine_id and for each unique value of Supplier
dfb_prop_with_supplier = dfb.groupby(['Batch','machine_id','Supplier','Result']).size().reset_index(name='num_rows')
dfb_cnt_success_with_supplier = dfb_prop_with_supplier[dfb_prop_with_supplier['Result'] == 1].groupby(['Batch','machine_id','Supplier','Result']).agg({'num_rows': 'sum'})
dfb_cnt_failed_with_supplier = dfb_prop_with_supplier[dfb_prop_with_supplier['Result'] == 0].groupby(['Batch','machine_id','Supplier','Result']).agg({'num_rows': 'sum'})
dfb_cnt_all_with_supplier = dfb_prop_with_supplier.groupby(['Batch','machine_id','Supplier']).agg({'num_rows': 'sum'})
dfb_prop_failed_with_supplier = dfb_cnt_failed_with_supplier / dfb_cnt_all_with_supplier
dfb_prop_failed_with_supplier.dropna(inplace=True)
# keep Batch info
dfb_prop_failed_with_supplier_plot = dfb_prop_failed_with_supplier.reset_index()
dfb_prop_failed_with_supplier_plot.rename(columns={'num_rows': 'proportion'}, inplace=True)
sns.catplot(data=dfb_prop_failed_with_supplier_plot, x='Batch', y='proportion', row='machine_id', hue='Supplier', height=4, aspect=2.2)
<seaborn.axisgrid.FacetGrid at 0x229ff3bef90>
You must cluster the rows of dfb using the 4 operational variables x1 through x4. You must decide how many clusters to use and describe how you made that choice. You may use KMeans OR Hierarchical clustering. Include any figures that helped you make that choice.
Visualize your cluster analysis results by:
You are interested in the PROPORTION of cell phone cases that failed the DROP TEST. Are any of the clusters associated with higher failure PROPORTIONS than others? Based on your visualizations how would you describe that cluster?
Add as many cells as you see fit to answer this question.
# 0.Observation of relationship between variables
dfb_cluster = dfb[['x1','x2','x3','x4']]
sns.pairplot(dfb_cluster)
<seaborn.axisgrid.PairGrid at 0x229fe1fe890>
# 1. Convert format
X = dfb_cluster.select_dtypes('number').copy().to_numpy()
# 2. Optimal number of clusters
tots_within=[]
K = range(1, 30)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=123,n_init=25, max_iter=500).fit(X)
tots_within.append(kmeans.inertia_)
fig,ax = plt.subplots(figsize=(6,4))
ax.plot(K, tots_within, 'bo-')
ax.set_xlabel('Number of Clusters')
ax.set_ylabel('Total Within Sum of Squares')
plt.show()
4 clusters would be an appropriate choice.
cluster_4 = KMeans(n_clusters=4, random_state=123,n_init=25, max_iter=500).fit_predict(X)
dfb_cluster_k4 = dfb_cluster.copy()
dfb_cluster_k4['k_4'] = pd.Series(cluster_4, index=dfb_cluster.index).astype('category')
dfb_cluster_k4
## Plot the number of observations in each cluster
sns.catplot(data=dfb_cluster_k4, x='k_4', kind='count')
## Visualize the relationships between the operational variables GIVEN the cluster
sns.pairplot(dfb_cluster_k4, hue='k_4', diag_kws={'common_norm':False})
<seaborn.axisgrid.PairGrid at 0x2298d01b050>
## Hierarchical Clustering
# 1.Standardize the data
X_dfb_features = dfb_cluster.select_dtypes('number').copy()
X_dfb = StandardScaler().fit_transform(X_dfb_features)
### Default cluster
hclust_complete = hierarchy.complete(X_dfb)
sns.set_style('white')
fig = plt.subplots(figsize=(12,6))
dn = hierarchy.dendrogram(hclust_complete, no_labels=True)
plt.show()
### Single Linkage
hclust_single = hierarchy.single(X_dfb)
sns.set_style('white')
fig = plt.subplots(figsize=(12,6))
dn = hierarchy.dendrogram(hclust_single, no_labels=True)
plt.show()
### Average Linkage
hclust_average = hierarchy.average(X_dfb)
sns.set_style('white')
fig = plt.subplots(figsize=(12,6))
dn = hierarchy.dendrogram(hclust_average, no_labels=True)
plt.show()
### Centroid Linkage
hclust_centroid = hierarchy.centroid(X_dfb)
sns.set_style('white')
fig = plt.subplots(figsize=(12,6))
dn = hierarchy.dendrogram(hclust_centroid, no_labels=True)
plt.show()
### Ward Linkage
hclust_ward = hierarchy.ward(X_dfb)
sns.set_style('white')
fig = plt.subplots(figsize=(12,6))
dn = hierarchy.dendrogram(hclust_ward, no_labels=True)
plt.show()
All methods from Hierarchical Clustering showed same results with K-means clustering, showing that we could have 4 clusters in the data set. We continue use dfb_cluster_k4.
## Plot clusters and failure proportions
df_cluster_result = pd.merge(dfb, dfb_cluster_k4, how='left')
df_cluster_result
| ID | Batch | s_id | x1 | x2 | x3 | x4 | machine_id | Supplier | Density | test_group_id | Result | k_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S002 | 1 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 | B | 10.388587 | A-0 | 1 | 3 |
| 1 | B001-M01-S024 | 1 | 24 | 51.531574 | 100.207219 | 22.281345 | 13.796810 | 1 | B | 10.388587 | A-0 | 1 | 0 |
| 2 | B001-M01-S030 | 1 | 30 | 48.306819 | 109.445722 | 22.910858 | 13.839830 | 1 | B | 10.388587 | A-0 | 0 | 0 |
| 3 | B001-M01-S038 | 1 | 38 | 52.058573 | 93.272568 | 21.937216 | 13.332882 | 1 | B | 10.388587 | A-0 | 1 | 2 |
| 4 | B001-M01-S048 | 1 | 48 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 | B | 10.388587 | A-0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1407 | B050-M02-S033 | 50 | 33 | 52.139310 | 101.869648 | 20.802603 | 10.803831 | 2 | B | 9.935676 | V-3 | 1 | 0 |
| 1408 | B050-M02-S064 | 50 | 64 | 52.048377 | 98.445039 | 21.305935 | 11.142095 | 2 | B | 9.935676 | V-3 | 0 | 2 |
| 1409 | B050-M02-S065 | 50 | 65 | 52.116809 | 101.042307 | 20.695658 | 11.051621 | 2 | B | 9.935676 | V-3 | 0 | 0 |
| 1410 | B050-M02-S073 | 50 | 73 | 52.484297 | 101.426409 | 20.506720 | 11.032949 | 2 | B | 9.935676 | V-3 | 0 | 0 |
| 1411 | B050-M02-S092 | 50 | 92 | 52.690936 | 95.050804 | 20.324674 | 10.963370 | 2 | B | 9.935676 | V-3 | 0 | 2 |
1412 rows × 13 columns
# count plot between k_4 and Result
sns.catplot(data=df_cluster_result, x='k_4', hue='Result', kind='count')
<seaborn.axisgrid.FacetGrid at 0x2298aa74310>
This shows that we have higher failure proportion than success rate in all 4 clusters, and cluster 3 has the highest failure proportion.